- Big data engineer, who undertakes complex assignments, meets tight deadlines and delivers superior performance. Possesses practical knowledge in Data Analytics and Optimization. Applies strong analytical skills to inform senior management of key trends identified in the data. Operates with a strong sense of urgency and thrives in a fast - paced setting.
- Overall 8+ years of experience in design and deployment of Data Management and Data Warehousing Projects in various roles as Data Analyst on Big data technologies.
- Possesses 4+ years of rich Hadoop experience in design and development of Big Data applications , which involves Apache Hadoop Map/Reduce, HDFS, Hive, HBase, Pig, Oozie, Sqoop, Flume and Spark.
- Experience with all flavor of Hadoop distributions including Cloudera, Horton works.
- Knowledge in developing solutions around NOSQL databases like MongoDB and Cassandra .
- Experience with various performance optimizations like using distributed cache for small datasets, partition, bucketing in Hive.
- Excellent hands on experience in analyzing data using Pig Latin, HQL, and HBase.
- Strong Knowledge of Hadoop, Hive and Hive's analytical functions.
- Loaded the dataset into Hive for ETL Operation.
- Proficient using of big data ingestion tools like Flume and Sqoop.
- Experience in importing and exporting data between HDFS and Relational Database Management systems using Sqoop.
- Experience in handling continuous streaming data using Flume and memory channels.
- Good knowledge on executing Spark SQL queries against data in Hive.
- Well experienced in developing scripts using Hadoop, Sqoop, Flume, Impala, Aorta Framework, Automic, Apache Spark, Hive, Kafka, and PIG.
- Well versed Experience on Scala, Python, HQL scripts and streaming data.
- Extensive experience in data transformation, data mapping from source to target database schemas.
- Experienced in working with Python for Statistical Analysis along with MySQL, NoSQL for various environments and business processes.
- Expertise in dealing with large datasets with R, Python and math libraries like Pandas, Numpy, Scikitlearn.
- Experienced in parallel processing to maximize calculation speed and efficiency.
- Well versed using version control tools like GIT and Bit bucket.
- Experience in Data Visualization including producing tables, graphs, listings using various procedures and tools such as Tableau support understanding or explanation of data and final report.
- Knowledge on Predictive analysis and Machine learning algorithms.
- Worked in Agile environment, followed strict agile practices using Jira.
- Strong Problem Solving and Analytical skills and abilities to make Balanced & Independent Decisions.
- Good Team Player, Strong Interpersonal, Organizational and Communication skills combined with Self-Motivation, Initiative and Project Management Attributes.
- Holds strong ability to handle multiple priorities and work load; also has ability to understand and adapt to new technologies and environments faster.
Platforms: Cloudera, Linux, MacOS, Amazon S3
Methodologies: Agile, Waterfall
Testing Tools: HP-ALM 11.0
Languages: Scala, Python, R, ML, Java
Office Tools: MS Word, Excel, Outlook, Powerpoint
Hadoop Tools: Aorta, Automic, SQOOP, FLUME, HIVE, Apache Spark, HBASE
BI Tools: Tableau, Thoughtspot
Database: Oracle 11G/10G/9i and Microsoft SQL Server 2005/2008/2012
Confidential, Bentonville, AR
Big Data Engineer
- Develop HQL queries to perform DDL and DML’s.
- Develop transformations scripts using Spark SQL to implement logic on Lake Data.
- Develop work flows between Hadoop, Hive and MariaDB using Aorta framework.
- Schedule jobs on Automic to automate aorta framework Workflows.
- Design and develop data pipeline using the YAML scripts and Aorta Framework.
- Ingest data from Maria DB to Hadoop using Aorta Framework.
- Performed statistical analysis of data coming through Kafka.
- Create, Validate and maintain Aorta scripts to load data from various data sources to HDFS.
- Creating external tables to read data into Hive from RDBMS.
- Develop shell scripts to build a framework between Hive and Maria DB.
- Unit testing the Aorta jobs and Automic work flows.
- Developed Data Quality (dq) scripts to validate and maintain data quality for downstream applications.
- Validated the Thought spot data against base tables in Data Lake.
- Performed end to end testing of Opportunities and KPI applications.
Environment: Kafka, Hue, Hive, Hadoop, Scala, HQL, Maria DB, SQL, HortonWorks.
Confidential, Fremont, CA
- Worked on a live 60 nodes Hadoop cluster running CDH5.
- Worked with structured, unstructured and semi structured data of 23 TB in size.
- Developed a Sqoop incremental Import Job for importing data into HDFS.
- Moved data from HDFS into spark shell to perform processing of stock data.
- Moved data from Amazon s3 into spark for processing data.
- Used yarn as a resource manager for parallel computing.
- Performed transformations and actions on spark for different file formats like parquet, avro, json and text files.
- Exported new sets of datasets into staging tables using Maria DB.
- Developed programs using Scala for spark (RDD, Data frames and Spark SQL) to handle Mathematical calculations and deployed it in the cluster.
- Developed programs using Scala for spark streaming to handle live data using DStream to analyze market trends.
Environment: Hadoop Ecosystem, HDFS, Map Reduce, Sqoop, Spark, Spark SQL, Scala, R, Amazon S3, Maria DB, Yarn, Cloudera.
Confidential, San Francisco, CA
- Built data pipelines to Load and transform large sets of structured, semi structured and unstructured data.
- Imported data from HDFS into Hive using Hive QL.
- Involved in creating Hive tables, loading and analyzing data using hive queries.
- Created Hive Partitioned and Bucketed tables to improve performance.
- Developed a SQOOP Import Job for importing data into HDFS
- To improve performance and optimization of the existing algorithms, explored different components like Spark Context, Spark-SQL, Data Frame, Pair RDD's, accumulators.
- Processed millions of records using Hadoop jobs
- Implemented Spark code using Python for RDD transformations & actions in Spark application
- Used Java UDF jar files to mask the PII data.
- Define and contribute to development of standards, guidelines, design patterns and common development frameworks & components.
- Working with the leadership to understand scope, derive estimates, schedule, allocate work, manage tasks/projects, present status updates to IT and business leaders as required.
Environment: Hadoop Ecosystem, Hive, HDFS, Python, Eclipse, Sqoop, Apache Spark, Java.
- Analyze the Nielsen data for forecasting the impressions of television shows using time series analysis (predictive analysis) with R.
- Assess the data quality using python scripts and provide the insights using pandas 1.18.0.
- Provide the metrics to upfront team (adsales) using Tableau Dashboard which helps them for ad revenue.
- Performed statistical analysis using R.
- Monitoring Schedule jobs on data bricks.
- Interface with other technology teams to extract, transform, and load data from a wide variety of data sources using SQL and other ETL solutions.
- Implement the logic in pandas for feature engineering on subset of data which will be used for forecasting in AWS.
Environment: Python, Tableau, R, Machine Learning.
- Implemented the automated ETL Process using Big data, which enabled users to view dashboard analytics
- Identified the trend in crowd funding campaign data using statistical analysis in R, which provided recommendations to investors based on their interest.
- Extracted the entire crowd funding campaigns from crowd funding websites (Indiegogo, Kickstarter etc.)
- Regression analysis on the crowd funding campaigns to predict the funding goal
- Design the database for spot crowd projects, a web application for investors, business owners, brokers who help on investments and service providers for business, with a list of companies which has projects at various levels.
- Implemented Big data technology to process the data and store it in Hive tables
Environment: R, Talend Big Data, Python, MySQL, Hive