We provide IT Staff Augmentation Services!

Big Data Engineer Resume

0/5 (Submit Your Rating)

MD

SUMMARY

  • 8+ years of IT experience in a variety of industries working on Big Data technology using technologies such as Cloudera and Hortonworks distributions. Hadoop working environment includes Hadoop, Spark, MapReduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.
  • Fluent programming experience with Java, Python, SQL, T - SQL, R.
  • Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
  • Adept at configuring and installing Hadoop/Spark Ecosystem Components.
  • Worked with Cloudera and Hortonworks distributions
  • Experience in working with Flume and NiFi for loading log files into Hadoop.
  • Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala. Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Experience in application of various data sources like Oracle SE2, SQL Server, Flat Files and Unstructured files into a data warehouse.
  • Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
  • Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, PowerBI and Microsoft SSIS.
  • Experience with Azure transformation projects and Azure architecture decision making Architect and implement ETL and data movement solutions using Azure Data Factory (ADF), SSIS
  • Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow.
  • Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Hadoop MapReduce programming.
  • Extensively usedPythonLibraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
  • Well versed experience in Amazon Web Services (AWS) Cloud services like EC2, S3.
  • Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering and data aggregation. Also possess detailed knowledge of MapReduce framework.
  • Seasoned practice in Machine Learning algorithms and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means Clustering.
  • Ample knowledge of data architecture including data ingestion pipeline design, Hadoop/Spark architecture, data modeling, data mining, machine learning and advanced data processing.
  • Experience working with NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase.
  • Developed Spark Applications that can handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources.
  • Experience in analyzing data using HiveQL, Pig, HBase and custom MapReduce programs in Java 8.
  • Good Knowledge in Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data.
  • Expertise working with AWS cloud services like EMR, S3, Redshift, EMR cloud watch, for big data development.

TECHNICAL SKILLS

Big Data Ecosystem: HDFS and Map Reduce, Pig, Hive, Impala, YARN, HUE, Oozie, Zookeeper, ApacheSpark, Apache STORM, Apache Kafka, Sqoop, Flume.

Operating Systems: Windows, Ubuntu, Red Hat Linux, Unix

Programming Languages: C, C++, Java, Python

Scripting Languages: Shell Scripting, Java Scripting

Databases: Oracle 11g/10g/9i, MySQL, DB2, MS-SQL Server, SQL, PL/SQL, Teradata

NoSQL Databases: HBase, Cassandra, and MongoDB

Hadoop Distributions: Cloudera, Hortonworks

Build Tools: Ant, Maven, sbt

Development IDEs: NetBeans, Eclipse IDE

Web Servers: Web Logic, Web Sphere, Apache Tomcat 6

Cloud: AWS

Version Control Tools: SVN, Git, GitHub

Packages: Microsoft Office, putty, MS Visual Studio

PROFESSIONAL EXPERIENCE

Big Data Engineer

Confidential - MD

Responsibilities:

  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython.
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
  • Installing, configuring and maintaining Data Pipelines
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • Develop solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python
  • Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon Sage Maker.
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
  • Loading data from different sources to a data ware house to perform some data aggregations for business Intelligence using python.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Worked on Spark SQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
  • Created AWS Lambda, EC2 instances provisioning on AWS environment and implemented security groups, administered Amazon VPC's.
  • Used Apache NiFi to copy data from local file system to HDP. Thorough understanding of various modules of AML including Watch List Filtering, Suspicious Activity Monitoring, CTR,CDD, and EDD.
  • Start working with AWS for storage and holding for tera byte of data for customer BI Reporting tools
  • Validated the test data in DB2 tables on Mainframes and on Teradata using SQL queries.
  • Created a Serverless data ingestion pipeline on AWS usingMSK(Kafka)and lambda functions.
  • Developed applications using Java that reads data from MSK (Kafka) and writes it toDynamo DB.
  • Develop Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
  • Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System.

Environment: Cloudera Manager (CDH5), Hadoop, Pyspark, HDFS, NiFi, Pig, Hive, AWS, Kafka, Scrum, Git, Sqoop, Oozie. Pyspark, Informatica, Tableau, OLTP, OLAP, HBase, Cassandra, Informatica, SQL Server, Python, Shell Scripting, XML, Unix.

Sr. Data Engineer

Confidential - Phoenix, AZ

Responsibilities:

  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines
  • Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
  • Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
  • Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. Knowledge of USQL
  • Responsible for working with various teams on a project to develop analytics-based solution to target customer subscribers specifically.
  • Built a new CI pipeline. Testing and deployment automation with Docker, Swamp, Jenkins and Puppet. Utilized continuous integration and automated deployments with Jenkins and Docker.
  • Data visualization:Pentaho, Tableau, D3. Have knowledge of Numerical optimization, Anomaly Detection and estimation, A/B testing, Statistics, and Maple. Have big data analysis technique using Big data related techniques i.e.,Hadoop, MapReduce, NoSQL, Pig/Hive, Spark/Shark, MLlibandScala, numpy, scipy, Pandas, scikit-learn.
  • Developed frameworks and processes to analyze unstructured information. Assisted in Azure Power BI architecture design
  • Developed spark applications in python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tablesx.
  • Transform and analyze the data using PySpark, HIVE, based on ETL mappings.
  • Developed PySpark programs and created the data frames and worked on transformations.
  • Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, and Load) tools and methodologies to collect of data from various sources into a single data warehouse.
  • Applied variousmachine learning algorithmsand statistical modeling likedecision trees, text analytics, natural language processing (NLP),supervised and unsupervised, regression models, social network analysis, neural networks, deep learning, SVM, clusteringto identify Volume usingscikit-learn packageinpython, R, and MATLAB. Collaborate withData Engineers and Software Developersto develop experiments and deploy solutions to production.
  • Create and publish multiple dashboards and reports usingTableau server and work onText Analytics, Naive Bayes, Sentiment analysis, creating word cloudsand retrieving data fromTwitterand othersocial networking platforms.
  • Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning usingPython scripts.
  • Tackle highly imbalanced Fraud dataset using under sampling with ensemble methods, oversampling and cost sensitivealgorithms.
  • Improve fraud prediction performance by using random forest and gradient boosting for feature selection withPython Scikit-learn.
  • Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics)
  • Involved inUnit Testingthe code and provided the feedback to the developers. PerformedUnit Testingof the application by usingNUnit.
  • Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
  • Wrote production level Machine Learning classification models and ensemble classification models from scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.

Environment: Hadoop, Kafka, Spark, Sqoop, Docker, Swamp, Spark SQL, TDD, Spark-Streaming, Hive, pig, NoSQL, Impala, Oozie, HBase, Data Lake, Zookeeper.

Data Engineer

Confidential - NJ

Responsibilities:

  • Developed and implemented Apache NIFI across various environments, written QA scripts in Python for tracking files.
  • Deployed the initial Azure components like Azure Virtual Networks, Azure Application Gateway, Azure Storage and Affinity groups.
  • Developed data pipeline using Sqoop to ingest customer behavioral data and purchase histories into HDFS for analysis.
  • Delivered de normalized data forPower BIconsumers for modeling and visualization from the produced layer in Data Lake
  • Written Kafka REST API to collect events from front end.
  • Involved in creating HDInsight cluster in Microsoft Azure Portal also created Events hub and Azure SQL Databases.
  • Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
  • Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage
  • Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework.
  • Involved in running all the hive scripts through Hive. Hive on Spark and some through Spark SQL.
  • Worked on product positioning and messaging that differentiate Hortonworks in the open-source space.
  • Experience in design and developing Application leveraging MongoDB.
  • Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
  • Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
  • Involved in complete big data flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS.
  • Troubleshooting the Azure Development, configuration and Performance issues.
  • Interacted with multiple teams who are responsible for Azure Platform to fix the Azure Platform Bugs.
  • Providing 24/7 support for on-call on Azure configuration and Performance issues.
  • Created Partitioned and Bucketed Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
  • Responsible for importing data to HDFS using Sqoop from different RDBMS servers and exporting data using Sqoop to the RDBMS servers.
  • Used Jira for bug tracking and Bitbucket to check-in and checkout code changes.

Environment: Azure, HDFS, Yarn, MapReduce, Hive, Sqoop, Flume, Oozie, Kafka, Impala, Spark SQL, Spark Streaming, Eclipse, Oracle, Teradata, PL/SQL UNIX Shell Scripting.

Data Engineer

Confidential

Responsibilities:

  • Experience creating and organizing HDFS over a staging area.
  • Imported Legacy data from SQL Server and Teradata into Amazon S3
  • As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
  • Wrote Python code to manipulate and organize data frame such that all attributes in each field were formatted identically.
  • Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e., Name, Address, SSN, Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project
  • Developedstored procedures/views in Snowflakeand use inTalendfor loading Dimensions and Facts.
  • Developed merge scripts toUPSERTdata intoSnowflakefrom an ETL source.
  • Utilized Pandas to create a data frame
  • Implemented UNIX scripts to define the use case workflow and to process the data files and automate the jobs.
  • Created bashrc files and all other xml configurations to automate the deployment of Hadoop VMs over AWS EMR.
  • Developed a raw layer of external tables within S3 containing copied data from HDFS.
  • Created a data service layer of internal tables in Hive for data manipulation and organization.
  • Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
  • Compare the data in a leaf level process from various databases when data transformation or data loading takes place. I need to analyze and look into the data quality when these types of loads are done (To look for any data loss, data corruption).

Environment: HDFS, AWS, SSIS, Snowflake Hadoop, Hive, HBase, MapReduce, Spark, Sqoop, Pandas, MySQL, SQL Server, PostgreSQL, Teradata, Java, Unix, Python, Tableau, Oozie, Git.

Python Developer/ Data Engineer

Confidential

Responsibilities:

  • Hands on experience in python packages likeNumPy,pandas, SciPy,PyTables
  • Understood the business process and developed a process flow for automating request.
  • Built data pipelines in python to get useful insights from data and streamline the incoming data
  • Along with my team developed APIs using python to dump the array structures in the processor at the failure point.
  • Developed theAPIswhich monitor the memory subsystem using python.
  • Represented the system in hierarchy form by illustrating the components and subcomponents usingPython.
  • Developed set of library functions over the system based on the user needs
  • Used Python-MySQL connector and MySQL Database package to query data stored in MySQL databases and retrieve required information for the client.
  • Scripted close to 100 python and batch scripts that automate ETL scripts which needs to be executed every hour.
  • Developed ETL scripts in Python to extract data from a database and update the resulting data into another database.
  • Developed data transition programs from DynamoDB to AWS Redshift (ETL Process) using AWS Lambda by creating functions in Python for the certain events based on use cases.
  • Worked on AWS SQS to consume the data from S3 buckets.
  • Deployed Airflow on S3 instances mounted to EFS as central directory with broker as SQS and stored metadata in RDS and logs to S3 Buckets.
  • Integrated new tools and technology frameworks to accelerate the data integration process and empower the deployment of predictive analysis

Environment: Python, My SQL, ETL AWS, Airflow.

We'd love your feedback!