Data Engineer Resume
Minneapolis, MN
SUMMARY
- Over 8+ Years of Experience in Bigdata analytics and distributed data processing framework with large scale datasets using Bigdata technology stack - Apache Kafka, Spark, Python (PySpark), Scala, Hive, Impala, Hbase, Hadoop File System (HDFS) and Cloudera (CDH).
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipeline system.
- Expert in HDFS, Kafka, Spark, Hive, Sqoop, MapReduce, YARN, HBase, Oozie and Zookeeper.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, and MapReduce concepts.
- Sound Experience withAWS cloud Services(EMR, EC2, RDS, EBS, S3, Kinesis, Lambda, Glue, Athena, Elasticsearch, SQS, DynamoDB, Redshift, ECS)
- Experience with improving Spark performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames.
- Strong knowledge of Pyspark and Spark SQL analytical functions and extending functionalities by writing custom UDFs.
- Extensively worked on SPARK SQL & PY-SPARK scripts and schedule the spark jobs in OOZIE workflow.
- Experience designing and implementing fast and efficient data acquisition using Big Data processing techniques and tools.
- Experience in building data pipelines for data collection, storage and processing of data.
- Good Experience in Data Visualization tools like Kibana and Tableau to display graphs.
- Experience in using Amazon Web Services (AWS) in creating EC2 instances and S3 storage.
- ETL transformations using AWS Glue and AWS Lambda to trigger & process events.
- Working Knowledge on MLIB in Spark using linear regression, navies Bayes and other machine learning algorithms.
- Experience in handling python and spark context when writing Pyspark programs for ETL.
- Experience in Realtime data streaming using NiFi and KAFKA.
- Extensively used Spark-SQL & Python API’s for querying & transformation of data in Hive using Data frames.
- Experience in using stack driver service/dataproc clusters in GCP for accessing logs for debugging.
- Experienced in migrating form on premise toAWS using AWS Data PipelineandAWS Firehose.
- Experience in creating the REST API’s and CRUD operations like post, put and getrequest using curl.
- Knowledge in both relational databases (RDBMS) such as MySQL, PostgreSQL and NoSQL databases such as MongoDB, Cassandra.
- +Experience in working with Hadoop clusters using Cloudera and AWS EMR.
- Good knowledge of SQL process and experienced in building queries.
- Knowledge in SQL database design and development in writing Constraints, Indexes, Views, Stored Procedures and Triggers using MySQL.
- Experience in project management and Bug Tracking tool such as JIRA and Bugzilla.
- Experience with version control tools such as GIT, GitHub and SVN.
- Hands on Experience in Continuous Integration (CI) and Continuous Deployment (CD) using Jenkins. using Autosys and Airflow DAG’s creation and scheduling.
- Experience working onDockercomponents likeDockerEngine, Hub, Machine, creatingDockerimages, ComposeDockerRegistry and handling images primarily for middleware installation & domain configuration.
- Good experience in AGILE development environment and Agile Frameworks like SCRUM.
TECHNICAL SKILLS
Big Data Ecosystems: Hadoop, Map Reduce, Spark, HDFS, HBase, Pig, Hive, Sqoop, Kafka, Cloudera, Horton works, Oozie, Nifi, and Airflow.
Spark Technologies: Spark SQL, Spark Data frames and RDD
Scripting Languages: Python and shell scripting
Programming Languages: Python, Scala, SQL, PL/SQL
Cloud Technologies: Azure, AWS EMR, EC2, S3, Glue, Athena, Redshift, Docker
Databases: Oracle, MySQL and Microsoft SQL Server
NoSQL Technologies: HBase, MongoDB, Cassandra, DynamoDB
BI tools: Tableau, Kibana, Power BI
Web Technologies: SOAP, and REST.
Other Tools: Eclipse, PyCharm, Git, ANT, Maven, Jenkins, SOAP UI, QC, Jira, Bugzilla
Methodologies: Agile /Scrum, Waterfall
Cloud Technologies: Azure, AWS, GCP
Operating Systems: Windows, UNIX, LINUX.
PROFESSIONAL EXPERIENCE
Confidential, Minneapolis, MN
Data Engineer
Responsibilities:
- Worked with theSparkfor improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD, Spark YARN.
- Worked with Data Science team running Machine Learning models onSparkEMRcluster and delivered the data needs as per business requirements.
- Automated the process of transforming and ingesting terabytes of monthly data inParquetformat usingKinesis, S3, LambdaandAirflow.
- UsedSpark Streaming APIsto perform transformations and actions on the fly for building common learner data model which gets the data fromKafkain Near real time and persist it toCassandra.
- Involved in building a data pipeline and performed analytics usingAWS stack(EMR, EC2, S3, RDS, Lambda, Kinesis, Athena, Glue, SQS, Redshift, and ECS).
- Connected Redshift to Tableau for creating dynamic dashboard for analytics team.
- Handled AWS Management Tools as Cloud Watch and Cloud Trail.
- Loaded data intoS3buckets usingAWS GlueandPySpark. Involved in filtering data stored inS3buckets usingElastic searchand loaded data intoHiveexternal tables. UtilizedSpark’sin memory capabilities to handle large datasets onS3 Data Lake.
- Created program inpythonto handle PL/SQL functions like cursors and loops which are not supported by snowflake.
- DevelopedSparkjobs onDatabricksto perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases.
- Maintained Tableau functional reports based on user requirements.
- DevelopedSqoopjobs for data ingestion, incremental data loads from RDBMS toSnowflake.
- Created action filters, parameters, and calculated sets for preparing dashboards and worksheets in Tableau.
- Created workflows usingAirflowto automate the process of extracting weblogs intoS3 Datalake.
- Involved in developingbatchandstreamprocessing applications that require functional pipelining using Spark Scala and Streaming API.
- Involved in extracting and enriching multipleCassandratables using joins inSpark SQL. Also convertedHivequeries intoSparktransformations.
- Involved in developing data processing tasks using pySpark such as reading data from external sources, merge data, perform data enrichment and load in to Confidential data destinations.
- Develop Python,PySpark,Sparkscripts to filter/cleanse/map/aggregate data and data ingestion.
- Fetched live data fromOracledatabase usingSpark Streamingand AmazonKinesisusing the feed fromAPI Gateway RESTservice.
- Performed ETL operations usingPython, SparkSQL, S3andRedshifton terabytes of data to obtain customer insights.
- Performed interactive Analytics like cleansing, validation and quality checks on data stored inS3buckets usingAWS Athena.
- Involved in developingDocker imagesand deployingDocker containersin swarm.
- Involved in writingPythonscripts to automate ETL pipeline and DAG workflows usingAirflow. Manage communication between multiple services by distributing tasks on celery workers.
- Integrated applications using Apache tomcat servers on EC2 instances and automated data pipelines into AWS using Jenkins, GIT, Maven and Artifactory.
- Involved inmoving the raw databetween different systems using Apache Nifi.
- Involved in writing unit tests, worked along with DevOps team in Installing libraries, Jenkins agents and product ionized ETL jobs andmicroservices.
- Involved in setting up CI/CD pipelines usingJenkins. Involved in writing Groovy scripts to automate the Jenkins pipeline's integration and delivery service
- Managed and deployed configurations for the entire datacenter infrastructure usingTerraform.
- Worked with analytical reporting and facilitating data forQuicksightandTableaudashboards.
- UsedGitfor version control andJirafor project management, tracking issues and bugs.
- Practiced andevangelized agiledevelopment approaches. WroteANT scriptsand assisted with build and configuration management processes
Environment: Hadoop, Spark, PySpark, MapReduce, AWS, EC2, S3, EMR, Athena, Lambda, Glue, Elastic search, Spark Streaming, RDS, DynamoDB, Redshift, ECS, Hive, ETL, Nifi, NoSQL, Pig, Python, Scala, SQL, Sqoop, Kafka, Airflow, HBase, Oracle, Cassandra, Agile, Quicksight, Tableau, Maven, Jenkins, Docker, Git, Jira.
Confidential, Chicago, IL
Azure Data Engineer
Responsibilities:
- Designed and Configured Azure Cloud relational servers and databases analyzing current and future business requirements.
- Worked on migration of data from On-prem SQL server to Cloud databases(Azure Synapse Analytics (DW) & Azure SQL DB).
- Followed agile methodology for the entire project.
- Invovled in setting up separate application and reporting data tiers across servers using Geo replication functionality.
- Implemented Disaster Recovery and Failover servers in Cloud by replicating data across regions.
- Have extensive experience in creating pipeline jobs, scheduling triggers, Mapping data flows using Azure Data Factory(V2) and using Key Vaults to store credentials.
- To meet specific business requirements wrote UDF’s in Scala and Pyspark.
- Analyzing the Data from different sourcing using Big Data Solution Hadoop by implementing Azure Data Factory, Azure Data Lake, Azure Synapse, Azure Data Lake Analytics, HDInsights, Hive, Sqoop.
- Involved in developing the Spark Streaming jobs by writing RDD's and developing data frame using Spark SQL as needed.
- For Log analytics and for better query response used Kusto Explorer and created alerts using Kusto query language.
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily data.
- Develop and deploy the outcome using spark and scala code in Hadoop cluster running on GCP.
- Involved inAzureSite Recovery andAzureBackup and ConfiguringAzureBackup vault for protecting required VMs to take the VM level backups forAzureand On Premises Environment.
- Worked on creating tabular models onAzure analysis servicesfor meeting business reporting requirements.
- Developed ETL’s using PySpark and used both Dataframe API and Spark SQL API.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- PerformedETLusingAzure Databricks.
- Working with Azure BLOB andData lakestorage and loading data intoAzure SQL Synapse analytics (DW).
- Involved in loading and transforming large sets of structured and semi structured from multiple data source to Raw Data Zone (HDFS) using Sqoop imports and Spark jobs.
- Implemented ETL framework using Spark with Python and loaded standardize data into Hive and Hbase tables.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Wrote and executed various MYSQL database queries from python using Python-MySQL connector and MySQL dB package.
- Involved in converting Hive/SQL queries into Spark Transformations using Spark RDDs and Scala.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis
- Analyzed the SQL scripts and designed the solution to implement using PySpark
- Worked on creating correlated and non-correlated sub-queries to resolve complex business queries involving multiple tables from different databases.
- Developed business intelligence solutions using SQL server data tools and load data to SQL & Azure Cloud databases.
- Perform analyses on data quality and apply business rules in all layers of data extraction transformation and loading process.
- Involved downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities.
- Perform validation and verify software at all testing phases which includes Functional Testing, System Integration Testing, End to End Testing, Regression Testing, Sanity Testing, User Acceptance Testing, Smoke Testing, Disaster Recovery Testing, Production Acceptance Testing and Pre-prod Testing phases.
- Worked on Tableau to build customized interactive reports, worksheets and dashboards.
- Have good experience in logging defects in Jira and Azure Devops tools.
- Involved in planning cutover strategy, go-live schedule including the scheduled release dates of Portfolio central Datamart changes.
- Automated tasks using PowerShell.
Environment: Spark, Scala, Microsoft SQL Server, Azure Synapse Analytics, Azure Data Lake & BLOB, Azure SQL, Azure data factory (ADF), GCP, NoSQL, ETL, Azure analysis services, Python, BIDS.
Confidential, NY
Data Engineer
Responsibilities:
- Developed data pipelines using Sqoop, Pig and Hive to ingest customer data into HDFS to perform data analytics.
- Familiar with data architecture including data ingestion pipeline design, data modelling and data mining.
- Developed Hive queries to pre-process the data required for running the business process.
- Worked on developing ETL processes to load data from multiple data sources to HDFS using SQOOP.
- ConfiguredSpark streamingto get ongoing information from theKafkaand store the stream information to HDFS.
- Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.
- Used Talend for Big data Integration using Spark and Hadoop.
- Developing Spark scripts, UDFS using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.
- Written multiple MapReduce Jobs using Java API, Pig for data extraction, transformation and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE.
- Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB (NoSQL).
- Developed a detailed project plan and helped manage the data conversion migration from the legacy system to the Confidential snowflake database.
- Used DataStaxSpark connector which is used to store the data into Cassandra database or get the data from Cassandra database.
- Utilized Kubernetes and Docker for the runtime environment of theCI/CDsystem to build, test deploy.
- Involved in loading and transforming large sets of structured data from router location to EDW using an Apache NiFi data pipeline flow.
- Responsible for analysis of requirements and designed generic and standard ETL process to load data from different source systems.
- Performed end- to-end Architecture & implementation assessment of various AWS Cloud services like Amazon EMR, Redshift, S3, IAM, RDS, Cloud Watch, Athena.
- Performing statistical data analysis and data visualization using Python.
- Worked onMongoDB (NoSQL)for distributed storage and processing.
- Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service and Amazon DynamoDB.
- Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
- Implementing POC to migrate map reduce jobs into Spark RDD transformations using Python.
- Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature.
- Developed reusable framework to be leveraged for future migrations that automates ETL from
- RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.
- Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
- Worked on scheduling all jobs using Airflow scripts using python added different tasks to DAG, LAMBDA.
- Developed Kibana Dashboards based on the Log stash data and Integrated different source and Confidential systems into Elastic search for near real time log analysis of monitoring End to End transactions.
- Deploy new hardware and software environments required for PostgreSQL/Hadoop and expand existing environment.
- Implemented AWS Step Functions to automate and orchestrate the Amazon Sage Maker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
- Used Jenkins for CI/CD, Docker as a container tool and Git as a version control tool.
- Followed agile methodology including, test-driven and pair-programming concept.
Environment: Hadoop, Spark, Scala, AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB (NoSQL), Sage Maker, Glue, Athena, HBase, ETL, HDFS, Kafka, HIVE, SQOOP, Map Reduce, Pig, Python, Agile, Tableau.
Confidential
Data Engineer
Responsibilities:
- Involved in installation, configuration, design, developments and maintenances Hadoop cluster with several tools set with complete software development life cycle as an agile methodology.
- Worked on latest version of Hadoop distribution system such as Hortonworks Distribution.
- Worked on both kind of data processing as batch and streaming with ingestion to NoSQL and HDFS with different file format such as parquet and AVRO.
- Worked on integration of Kafka with Spark streaming for high-speed data processing.
- Developed multiple Kafka Producers and Consumers as per the business requirement also customized the partition to get optimized results.
- Involved in Big data requirement analysis, develop and design solutions for ETL platforms.
- Worked on data pipelines as per the business requirements and scheduled it using Oozie schedulers.
- Worked on advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala and Python as per requirements.
- Worked on cluster co-ordination with data capacity planning and node forecasting using Zookeepers.
- Worked on experimental Spark API for better optimization of existing algorithms such as Spark context, Spark SQL, Spark Streaming, Spark Data Frames.
- Involved on configuration, development of Hadoop environment with AWS cloud such as EC2, EMR, Redshift, Route 53, Cloud watch.
- Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
- Developed a python script to hit REST API’s and extract data to AWS S3.
- Involved in developing data processing tasks using pySpark such as reading data from external sources, merge data, perform data enrichment and load in to Confidential data destinations.
- Develop Python,PySpark,Sparkscripts to filter/cleanse/map/aggregate data and data ingestion.
- Worked on Spark and MLib to develop a linear regression model for logistic information.
- Worked on Exporting and analyzing data to the RDBMS using for visualization and to generate reports for the BI team.
- Supported in setting up QA environment and updating configurations for implementing scripts.
Environment: Scala, Spark SQL, Spark Streaming, Spark Data Frame, EC2, EMR, Redshift, Cloud Watch, S3, Spark MLib, Pyspark, ETL, HDFS, Hive, Sqoop, Kafka, Shell Scripting, Cassandra (NoSQL), Python, AWS, Tableau, SQL Server, GitHub, Maven.
Confidential
Software Engineer
Responsibilities:
- Developed several advancedMap Reduce programsto process data files received.
- Developed Map Reduce Programs fordata analysis and data cleaning.
- Developed data pipeline using Spark, Pig, python, Impala and HBase to ingest customer behavioral data and financial histories into Hadoop cluster for analysis.
- Worked on varioussummarization patternsto calculate aggregate statistical values over dataset.
- SparkStreaming collects data from Kafka in near-real-time and performs necessary transformations and aggregation to build the common learner data model and stores the data in NoSQL store (HBase).
- Involved in implementing joins in the analysis of dataset to discover interesting relationships.
- Completely involved in therequirement analysis phase.
- Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation.
- Extending Hive and Pig core functionality by writing customUDFs.
- Worked on partitioning theHIVE tableand running the scripts in parallel to reduce the run time of the scripts.
- Involved in internal and external tables ofHIVE andcreatedHive tablesto store the processed results in a tabular format.
- Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation.
- Implemented Partitioning, Dynamic Partitions, Buckets inHIVE.
- DevelopedPig Scripts and Pig UDFsto load data files into Hadoop.
- Analyzed the data by performingHive queriesand running Pig scripts.
- DevelopedPIG Latin scriptsfor the analysis of semi structured data and unstructured data.
- Worked on the process of creating complex data pipelines using transformations, aggregations, cleansing and filtering.
- Involved in writingcron jobsto run at regular intervals.
- Developed MapReduce jobs for Log Analysis, Recommendation and Analytics.
- UsedFlumeto efficiently collect, aggregate and move large amounts of log data.
- Involved in loading data from edge node toHDFS using shell scripting.
- Worked with application teams to install operating system,Hadoop updates, patches, version upgrades as required.
- Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing
- Involved in managing and reviewingHadoop log files.
- Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and troubleshooting, manage and reviewdata backups, manage and review Hadoop log files.
Environment: Hadoop, Spark, Kafka, Scala, Spark Streaming, Python, NoSQL, Apache Pig, Apache Hive, MapReduce, HDFS, Flume, GIT, ETL, UNIX Shell scripting, PostgreSQL, Linux, Agile.